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Combining probabilistic tagging with rule-based 
multilevel chunking for requirements elicitation 


In this paper author describes a multi-layered NLP approach for the elicitation of ontology relevant information from 
free requirements text. To automate the requirements elicitation process from textual information of stakeholders, as 
well as to transform it into structured and validated fashion the combination of probabilistic and rule-based 
NLP methods are proposed. The developed methodology includes a multi-level chunking strategy as its core 
principle. 


Introduction 


Within the software development process the problem of requirements elicitation is 
one of the main complicated questions, which needs to be solved in unconventional case. 
Obviously the requirements information quite often exists in a not explicit and ambiguous 
textual format. In this case the elicitation of domain specific concepts poses many difficul- 
ties to the requirements engineer. The software development process implies the transform 
requirements from text specifications to the special kinds of intermediate predesign models [1], 
[2]. For solving this task the different NLP (Natural Language Processing) techniques can be 
proposed [3]. 

The goal of the work was to develop an approach to process a free requirements text, 
to elicit the relevant requirements-driven information, and to transform it into structured, 
available for system end-user validation and interaction view. To succeed in this goal the 
following tasks were stated: 

— tokenization texts; 

— allocation of possible linguistic characteristics; 

— lemmatization; 

— sentence limits detection; 

— reduction of the characteristics of categories and disambiguation; 

— chunking; 

= implementation heuristic rules as e.g. conversion of categories; 

— output in XML format. 

Hence a NLP system for supporting the software designer to make implicit textual in- 
formation easier to trace was developed. The methodology is a combining probabilistic and 
rule-based parser, which processes the free text using morphosyntactic, sentence-semantic 
and phrasal information and enriched in XML format available. 

The proposed methodology includes algorithms for tagset mapping, pre-chunking and 
multi-level chunking of free English requirements text, the main layers are: 

1. The tagging task carried forward to QTAG, a probabilistic tagger written in Java by 
O. Mason [4]. 

2. The mapping engine we developed for splitting up and reinterpreting the standard 
QTAG-Set. 
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3. The identification of compound nouns. We suppose that implicitness is very often mo- 
tivated through ambiguity of complex terms, e.g. unclear structure of compounds or other 
groups of words. 

4. The extraction and generation of inflectional word forms. 

5. Some other morphological information extraction. 

6. Multi-words units and idiomatic expression identification. 

7. Verb subclass identification. 

8. Some chunking heuristics needed for grouping words to morphological units and 
syntactical chunks, which we chose as candidates for conceptualization nodes in the onto- 
logy layer. 


Related work 


For tagging English free texts many open source systems like the decision based “Tree- 
tagger” [5], the rule- and transformation-based “Brill tagger” [6], the maximum-entropy “Stan- 
ford POS Tagger” [7], the trigram based probabilistic “QTAG” [8] etc. are available. We 
chose “QTAG” because it is an extendable, trainable, language independent tagger. 

There are also approaches including components for chunking, e.g. “MontyLingua’”’ [9], 
“MontyKlu” (an online-version of “MontyLingua” developed by members of the research 
group in Klagenfurt [10]) and the “NLTK Toolkit” [11]. These systems mainly provide stan- 
dardized and acceptable output, but as we know they have been developed for educational 
purposes only. According to ontology engineering needs they are not really useful. 


General procedure 


In our NLP system (fig. 1) the data flow from the input state through processing 
blocks into the output state. The input is a raw text in “.doc” or “.txt” format, the existing 
free APIs, the QTokenizer [4] & QTAG, are used for producing tagged text. Between the 
blocks Q-Tag and QTokenizer there is the block “Tokenize Correction”, which corrects some 
well-known tokenization problems, such as possessive case of nouns (e.g. Peter’s), double 
adjectives, numbers with points, etc. 

For the transformation to the extended tagging format we use a mapping engine which 
is based on a set of mapping rules. 

Then after all possible information from Q-Tag output is extracted we need to get extra 
tagging information. The next processing step we call the pre-chunking procedure. It inclu- 
des the correction of a possibly wrong QTAG output, it solves some simple ambiguity prob- 
lems and it identifies certain verb subclasses according to those heuristic rules, which are 
based on internal heuristics. These procedures make the natural language material ready for 
chunking. Also on this stage our system makes lemmatization, corelex nouns identification. 
In the case of wrong tagged result one can create additional lexicons with needed entries of 
tags and its frequencies. These lexicons can be used as “‘before-consulted” resources for tag- 
ger and thus, text can be retagged for corrected result obtaining. 

The chunking procedure consists of multi-words units and idiomatic expressions de- 
tection, default chunking rules engine and elements of fine-granulated chunking methods. 
The chunking rules engine operates step by step depending on their level numbers and the 
chunk trees are built up respectively. 
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Figure 1 — Conceptual scheme of our NLP system 


As a result of this approach we receive an extended POS (part-of-speech) tagged text 
in tree view with the root node <text>. We store the output in the XML format, which allows 
us to review it with any XML editor and of course to interpret it in the next stage of the work- 
flow. Note that in principle all processing blocks work automatically in the default configu- 
ration, but the user can change the settings also manually. These possibilities are imaged with 
dashed arrows. 


Details of linguistic processing 


The mapping step. We use QTAG as the primary parsing and tagging engine. The out- 
put is needed for further processing. But as we pointed out the QTAG output had to be adop- 
ted for requirements engineering workflow needs since both systems have different view 
structures. So, we propose, that: 

1. QTAG carries out the basic processing step. 

2. We extract relevant information from the QTAG output and transform it into the 
enriched tagset format. 

3. We have to use some additional methods and heuristics to elicit semantic informa- 
tion needed during the further processing steps of the requirements engineering workflow. 

We propose that mapping from the shallow, standardized QTAG-Set [4] to the onto- 
logy-oriented tagset (developed in project NIBA [1], [12]), which consists of basic main 
POS-categories with arrays of attributes (e.g. v0 with subclass attribute “tvag2”(=mono- 
transitive verb with agentive subject), is necessary for the identification of ontological key 
relations and terms. Figure 2 shows how part-of-speech tags are extracted from the QTAG 
output and reassigned using the NIBA tagset notation (e.g. vO, nO, a0, etc.). 

Additional information about concrete part-of-speech instances is presented by using 
fine-granulated attributes. As an example, the verb “is” in QTAG gets the tag <BEZ>. This 
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tag decodes, that “is” is an auxiliary verb with the inherent morphosyntactic values present tense, 
singular, third person and having the “be” as the base form. After mapping and restructuring 
we receive the main class tag <vO> and an explicit set of attributes belonging to this tag: verb- 
class=”aux’’, temp=”pres”, form=”ind”, num="’sg” ps="3”, baseform="be”. 


» E > 


7 ot 


Figure 2 — Mapping process : 
Mappi 


Pre-Chunking. For the next step our aims are more sophisticated. We e nee- ae 
ding information about verb subclasses [12], VER fata, the base forms oF HOtng ria ae Pt 
about some other inflectional parameters like passive syoice construction. For solving this tas coe os oe 
we used some heuristics based on basic English gyapymar rules. =e 

Disambiguation of auxiliary verbs. Quite often auxiliary verbs (like “be” or “have’’) 
function as main verbs. WRB 

As a result of the mapping procedure our system assigns the value “aux” to all inflec- 
tional forms of “‘to be’, “to do”, “to have” and the modal verbs in the default case. In conse- 
quence we need to eventually reinterpret the subclass value “aux” as “copV/possV” (table 1) 
depending on the concrete syntactic context. 

For solving this problem the following heuristic procedure is chosen. The filtering 
engine goes through all tagged words, identifies the auxiliary verbs and then hepsi, => <y0> 
contextual conditions. : 

The following formal definitions are- reldmBGe Ais purpose: verbclass=”aux” temp=”p 


S — set of all sentences, S=5{Sj<S,j=I,n}; ps="3" bas 


W -— set of all output words in the same way; 
T — set of all output tags; 


v"™ — set of enriched output tags, which are agreed with auxiliary modeless verbs, 
vi" = {v0 ;:v0,.verbclass="aux" A v0, form # "modal",v0; « V}, V — set of all verb tags. 


Thus, the rule can be written as (for i = 1, Nye ): 


Vv0;: v0; ¢ V""", vO; € S,; #v0j;: vO; € S,, v0;form #”modal” 


— v0;.verbclass = copV , (1) 


Vv v0; : v0: e V e v0; € Sk v0;.baseform ="have”; v0; : v0; € Sk v0 form #” modal” 


— v0:.verbclass = possV. (2) 


Verb transitivity disambiguation. According to the requirements engineering purpo- 
ses we also need to identify other verb subclasses (Table 1). 
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Table 1 — Verb subclasses 


Ne | Abbreviation Description 

1 | aux Auxiliary verb 

2: |e: Ergative verb 

3. |1iV Intransitive verb 

4 | lokV Locations verb 

5 | possV Possessive verb 

6 | psychV Mental verb 

7 | tvag2 Monotransitive verb with agent subject 
8 | tv3 Ditransitive verb 

9 | sentV Perception verb 

10 | copV Copula verb 

11 | tv2 Monotransitive verb without agent subject 


Due to the fixed and transparent subject-verb-object (SVO) structure of English, nor- 
mally verb transitivity identification is a quite simple and straightforward task. Nevertheless 
we have to take into account that the phrasal structure very often inhibits simple solutions 
like for example counting of nouns. 

In our approach we used the following algorithm to cope with the problem of phrasal 
complexity: 

1. Create a set of rules which can operate on simple singular term subjects and objects 
(e.g. proper nouns and personal pronouns). 

2. Consult the exception database with already-assigned verbal subclass tags using trai- 
ning sentences which include default argument patterns. 

3. Reconstruct the structure of the primarily assigned phrases if relevant morphosyn- 
tactic features don’t fit. 

4. Leave open the possibility to manually change wrong/exceptional assignments or 
to add new information about verb classes. 

Presupposing the above defined default heuristic rules we add the following definitions 
for verb transitivity identification explained in (1) and (2) above: 


aux " " " " 
Vv0; €V,v0; €V v0; € S,3At;_). = n0.type =" proper | pron0.type =" pers" => 


dw, = "in" |"on", Wits Sh > v0 ;.verbclass ="lokV" | (3) 
Ati, = nPO,t,, € 5, — v0;.verbclass = "tvag2" | (4) 
At), =nPOU i, = nPO; mh x er ae a eS, > v0,.verbclass = "tv3" , (5) 


Note that nPO (noun pseudo-object) decodes word groups, which can include more than 
one noun. We identify those groups as noun compounds which can be taken as pseudo-ob- 
jects for transitivity disambiguation. In the following example: 

I write a system requirements document 
the sequence of three nouns in the right context of the verb is interpreted as one argument 
and in consequence as a compound noun. That’s why we classify “write” as a monotransi- 
tive verb. This process we call pseudo-object identification, because it doesn’t presuppose 
any syntactic analysis. 

In the next example sentence: 

He gave Peter balls 
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we identified two nouns in object position representing two different objects because of the 
missing congruence between these nouns with respect to proper/common opposition. Thus, we 
classify “give” as a ditransitive verb. 

Passive voice identification. For ontological design needs we have to detect passive 
constructions. 

In the default case passive constructions are built with the auxiliary verb “to be” and 
the participle II of a main verb. 

So, presupposing the definitions above, we define the following heuristic rule (for 


l= 1,N, ): 
Vv0; : v0; eV,v0; € S,.,v0;-baseform 7 "be"; 4v0; 4 : vO-44 E S7, v0; y.temp =" perf" 


—> v0,.mode =" pass", v0,,,;.mode =" pass". (6) 


During the development of this approach the use of the additional trigger for passive 
identification was concluded. This one takes into account the reduced form, so-called parti- 
cipial construction: 

Vv0,:v0, €V,v0,.temp =" perf", 4dw,,, € PrepPassList > v0,,mode="pass". (7) 

This rule is used for the reduced form identification and only in the case if the current 
verb is used with one of the predefined preposition. For example: 

Parents or legal guardians are filling in an application provided by the day nursery. 

Verbal base form generation. During the pre-chunking procedure we relate the inf- 
lectional variants of a verb to the base form through synchronization of the main class tags 
and diversification of attributes which are used to identify the verbal variants (e.g. v0 is as- 
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signed to “see”, “saw”, “seen”; type: perception verb (“sentV”’) is assigned to all three forms, 
“past” is assigned to “saw”, “perf” is assigned to “seen’’). 

For this purpose the procedure of base form generation was developed (fig. 3). Verb 
forms are extracted from the dynamically-extendable dictionary with the following verbal 
entry structure: <base-form>=<related form1><related form2>.... If a certain word is found in 
the dictionary, the engine simply returns all needed information. If the searched word is not 
in dictionary, one of two possible strategies is executed. In one case default endings are as- 
signed depending on the paradigm of regular verb forms. In the other case the user is offered 
the possibility to interact with the system. He can define base forms and/or inflected forms 
himself (fig. 3). 

e , 


Base form Look up Receive / 
i rere Results, 
generation in Dictionary forms ? 7 


Verb 


a 


<base-form> 
<3" form> 
<past form> 
<continuous form> 
<participle II form> 


3” form — (e)s 
Contin — ing 
Past — ed 
Participle Il - ed Dictionary.properties 


Soda eee — Pao: 


Dictionary 


Figure 3 — Verbal base form generation 


The common chunking rules. Based on some variants of the X-bar Theory [13] and 
on some core definitions in the project NIBA [1], [12] we composed a set of chunking rules 
for English for the production of syntactically and morphosynctactically motivated chunks 
(Table 2). 
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Table 2 — Excerpt from chunking rules 


Rule (Summands — Result) ee Rule descriptions Examples 
n0+n0 — n0 1 Compound Noun blood pressure 
[pt0]+a0 — a2 1 Adjective Phrase very nice 
[a0]+a0 — a2 1 Adjective Phrase bright green 
[pt0]+q0 — q2 1 Quantor Phrase very many 
[q0]+qg0 — q2 1 Quantor Phrase one million 
[pt0]+tadvO — adv2 1 Adverb Phrase very often 
[adv0]+adv0 — adv2 1 Adverb Phrase yesterday noon 
pron0(type=pers) — n3 1 Noun Phrase she 

O(verbclass=aux)+[adv0]+v0 
Te ne 1 Complex Verb will certainly go 
O(verbclass=aux)+pt0(type=neg)+v0 
a come (ype neg rv 1 Complex Verb would not write 
v0+pt0(type=verbal) 
—+v((type=complex) 1 Complex Verb wake up 
q2+q2 — q2 2 Quantor Phrase ae 
pron0(type=poss)+n0 — n3 Z Noun Phrase his mother 
[detO]+[a2]+[q2]+n0 — n3 é) Noun Phrase the nice two girls 
the three busy 
[det0]+[q2]+[a2]+n0 — n3 3 Noun Phrase selen tists 
p0+n3 — p2 4 Prepositional Phrase SE Diced pressure 
measurement 


There are several types of chunking rules, which are arranged in a certain order that 
should be followed during the chunking process. Summands are the array of input nodes 
which are needed for building the next resulting upper node of the chunking tree. Some of 
summands are strictly required for rule producing, they are written without square brackets, 
but some are not obligatory, they are placed inside brackets. 

The chunking process starts on the first level of the rule system, e.g. with the identifi- 
cation of compound nouns and complex verbs, and climbs up the rule hierarchy, ending up 
at the fourth level of the rule system with the identification of prepositional phrases. The step 
of constructing compound nouns is a very important precondition for ontology elicitation, 
because compounds normally serve as specialization terms in domain ontologies. In general 
we think that the number of simple nouns involved in compound correlates with the degree 
of specialization. 

Multi-words units. Multi-word units are phrases that function grammatically as single 
words, e.g. conjunction so that or preposition in spite of, receive a single POS tag, so they 
are treated here as single words. 

Before the work of the common chunking rules engine the phase of multi-word unit’s 
assignment is run. Since these collocations have only grammatical function we don’t com- 
bine them into the new node of more high level, but only assign to the parts of these units 
the relevant attributes. Algorithm of this procedure is the same as for the chunking rules with 
the exception of that we don’t create new node here. One of the particular cases of multi- 
word units is partial (or phrasal) verbs. 

Idiomatic expressions identification. Building up the chunking tree begins with pre- 
defined node phrases lists comparison. In English there are a lot of stable set expressions, 
e.g. idiomatic expression. That’s why for a good phrase-based parsing it is very important 
use these expressions for identifying POS-phrase on a much higher level. For this purpose 
we have extendable lists for different POS nodes and compare the entries from these lists 
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with our sentence constructs. And if one would find this construct, the new node on the next 
level would be created. 

Fine-granulated chunking rules. After that all default simple chunking rules and pre- 
defined patterns of idiomatic expressions are applied, the chunked tree of raw text is not 
complete. English as each modern language has a lot of complex internal structures, whose 
identification is a very important and implicit thing in NLP today. For the requirements eli- 
citation sense it is also very useful to extract some of that things, turns of speech and others. 
These tasks can be implemented with using of some predefined heuristics based on the com- 
mon rules of English usage. So, consider the things, which are relevant for this work, and 
methods which help to implement them: 

Preposition phrase as the post-modifier of the noun phrase. In our approach the follo- 
wing heuristic method is proposed (8): 

Vs,:s,ES°, s, = p2,s,.head € N3P2List, 


qq * cur Sait oe as (8) 
As, ,35,,€5™,s,,=n3 > 8,,.children,,,, =5,, 


ds, ,:8,,¢€S8",s8,,= p25, ,.children,,,,.children,,,, = s; 
Defining this rule the notations from previous definitions are used and some additional: 
— s,.head — the children of the node s,, which means the head of relevant phrase. 


In this case it is the preposition; 

— N3P2List — the list of prepositions, which signal about the post-modifier case. It con- 
sists of the following: “of”, “for”, “with”, “to”, “in”, “as”; 

— children,,,, — the element of the array “children” with the highest index. 

Note that the identification of post-modifier prepositional phrases should be realized 
as LIFO process, the searching engine must go through the sentence form right to left. 

Parenthesis is an explanatory or qualifying word, clause, or sentence inserted into a 
passage with which it doesn’t necessarily have any grammatical connection, and from which 
it is usually marked off by round or square brackets, dashes, or commas. In the capacity of 
parenthesis could be the noun groups or the whole sentence construct. To fix the idea let 
confine ourselves only to use brackets for parenthesis identification: 


Visti{sjeS™jialm, s_,=(h5,4=)! 

js, €{s,}:s, =v0s,, =sentence,s,,.type = parenthesis,s,,.children ="(U{s,}U')', (9) 
As, €{s,}:s, =v0> s,, =n3,s,,.tvpe = parenthesis,s,_,.children ="('U {s,;}U')' 

= s,,children,.. =S;, 


Homogeneous parts are parts of the same category standing in the same relation to other 
parts of the sentence a “contracted” sentences: 
Vis}i{5,} eS 7= Br, 
As, 1s, = VV s,= con0,s ,.type = coord, j =i(imod2=0)f), (10) 
ds, :s,=S,,,,/ =i(imod2 4 0) > s, =5,,5,.children = {s,} ,s,.type = homogeneous 
Descriptions of relative clause, infinitive groups, subordinate sentences and post-chun- 
ked verb subclasses assignment verification are omitted here for brevity. 


Conclusion and further work 


NLP driven ontology engineering is certainly one of the key technologies in the requi- 
rements engineering realm of the upcoming decade [14-16]. The proposed approach, based 
on the combining probabilistic POS-tagging with multilevel chunking, allows to proceed 
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from free requirements texts into the special ontology-oriented linguistic model enriched with 
the set of elicited attributes, relations and different characteristics. Parts of our approach are 
based on the algorithms which we described in this paper. The involved procedures are heu- 
ristically founded and follow a multi-level chunking strategy. 

This extended ontology-oriented linguistic model can be used further to generate the 
conceptual predesign requirements model [2] and fill its glossaries with concepts and model 
elements. For this actions the interpretations transformations should be used. 
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HA. bastcenoe 

CoueTaHve BepOxTHOCTHOIO TerMpOBaHHaA C OCHOBAHHBIM Ha 1paBlJiax MHOFOypOBHeBbIM 
CHHTAKCH4eCKMM AHaJIN30M JI H3BIeGeHHA TpedoBaHHnii 

CratTba ocBalleHa onucaHur1o NLP noyxoya kK W3BJIedeHHIO COOTBETCTBYIOINeH OHTONOTMYeCKON MHPopMallun 
3 HeoOpaOoTaHHBIX TeKCTOB TpeOoBaHult. Jit aBTOMaTH3allMH Iporecca W3BIIeueHHA TpeOOBAaHH U3 TeKCTOBOM 
WHopMallMn 3aMHTepeCOBaHHEIX JIMI, a Takoke Jia eé TpaHccbopMalluu B CTpyKTypupoOBaHHy!o UH TWpHroyHy}o 
IA Bauyaluu dopMy MpeyaraeTcA UCMONb30BaTb COUCTAaHHe BEPOATHOCTHBIX HM OCHOBAaHHBIX Ha mpaBisiax 
Metoyos NLP. PaspaOoTtaHHad MeTOJOIOrMA B KaYeCTBe OCHOBHOMO MPHHIWMa BKIIOUAeT B CeOA MHOTOYpPOBHeBYyrO 
cTpaTerMto CHHTaKCHYeCKOrO aHasIn3a. 


M.O. Bbastcenoe 

TloeqHaHHs iMOBIpHICHOrO TeryBaHHG i3 3ACHOBaHMM Ha WpaBiiax 

OaraTopiBHeBUM CHHTaKCH4HMM aHali30M JI BATSArY BAMOr 

Cratta mpucpagena omucy NLP nigxoyy fo BUTATY BITMOBIZHOI OHTONOTIMHO! iHdopMallii 3 HeompallboBaHHx 
TeKCTIB BUMOT. JIJId ABTOMATH3allii IpOWecy BATATY BUMOF 3 TEKCTOBO! IHOpMallii 3allikaBIEHHXx OciO, a TAKOXK 
quia Ti TpaHccbopMaliil B CTPyKTypoBaHy i IpUAaTHy Jia Basipartli bopMy MponoHyeTbca BUKOPHCTOByBaTH MOeqHaHHA 
IMOBIPHICHHX Ta 3ACHOBaHHX Ha TpaByyiax MeToyiB NLP. PospoOsena MeToJOuOria Y AKOCTI OCHOBHOrO TIPHHLUMITy 
BKIOUaE B Cebe OaraTOpiBHeBYy CTpaTerii0 CHHTaKCHYHOTO aHalli3y. 


Cmamba nocmynua 6 pedaxyuro 01.04.2010. 
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